[SYCL][CUDA] Implement Intel USM extension #1241

fwyzard · 2020-03-04T00:10:48Z

Implement the following functions in the CUDA plugin, and mark the tests for the USM features that are now supported.

Also, fix the CUDA version reported by SYCL.

Device

USM-related calls to piDeviceGetInfo()

Kernel

piextKernelSetArgPointer()

USM

piextUSMHostAlloc()
piextUSMDeviceAlloc()
piextUSMSharedAlloc()
piextUSMFree()
piextUSMEnqueueMemset()
piextUSMEnqueueMemcpy()
piextUSMEnqueuePrefetch()
piextUSMEnqueueMemAdvise() (see below)
piextUSMGetMemAllocInfo() (see below)

Limitations

As the Intel USM extension is still incomplete:

piextUSMEnqueuePrefetch() ignores the "flags" argument;
piextUSMEnqueueMemAdvise() does nothing.

bader · 2020-03-04T09:12:31Z

@fwyzard, thanks a lot for working on this.
@Ruyk, @bjoernknafla, could you take a look, please?

fwyzard · 2020-03-04T13:51:53Z

Updated to use cuMemcpyAsync directly, instead of dispatching to cuMemcpyHtoDAsync, cuMemcpyDtoHAsync, etc.

Ruyk

Thanks for this! @jbrodman will be happy :-) Few comments around.
Have you enabled any of the USM lit test to see if they pass?

sycl/plugins/cuda/pi_cuda.cpp

Ruyk · 2020-03-04T14:55:54Z

sycl/plugins/cuda/pi_cuda.cpp

+
+    if (event) {
+      retImplEv = std::unique_ptr<_pi_event>(
+          _pi_event::make_native(PI_COMMAND_MEMBUFFER_COPY, command_queue));


Probably not in the scope of this change, but the PI_COMMAND_MEMBUFFER_COPY is used for the non-usm operations. We probably need a different PI_COMMAND_USM_COPY for usm operations, since the OpenCL function is different in the cl_intel_unified_sahred_memory extension

sycl/plugins/cuda/pi_cuda.cpp

fwyzard · 2020-03-05T18:49:41Z

I've implemented piextUSMSharedAlloc() and piextUSMEnqueueMemset(), and hopefully addressed most of the comments.

I still need to look into using cuMemHostAlloc() with the optional write-combine flag instead of cuMemAllocHost().

If I missed anything, please mention it (again).

Ruyk · 2020-03-06T15:14:05Z

Still looking to see if there are more lit tests passing, USM tests are marked as "unsupported" for the CUDA backend so they won't report if they are passing now.

fwyzard · 2020-03-06T16:23:44Z

Still looking to see if there are more lit tests passing, USM tests are marked as "unsupported" for the CUDA backend so they won't report if they are passing now.

How do I run the USM tests with the CUDA backend ?

I have tried removing the // XFAIL: cuda line from all files under sycl/test/usm/, and running make check-sycl-cuda-usm in the build directory, and I got:

...
getDeviceCount cpu:PI_CUDA
Found available CPU device
getDeviceCount gpu:PI_CUDA
Found available GPU device
getDeviceCount accelerator:PI_CUDA
Could not find AOT device compiler opencl-aot
Found AOT device compiler ocloc
Could not find AOT device compiler aoc

Testing Time: 14.15s
  Expected Passes    : 24

Which is not what I expected, because

"Found available CPU device" makes no sense for CUDA
"Expected Passes : 24" (i.e. all) is too many, I would have expected same failers, since not all USM functions are implements.

fwyzard · 2020-03-06T17:27:06Z

In fact, looking at the tests being built and run, it looks like make check-sycl-cuda-usm

still builds with -fsycl-targets=spir64-unknown-linux-sycldevice instead of -fsycl-targets=nvptx64-unknown-linux-sycldevice
does not pass SYCL_BE=PI_CUDA at runtime

fwyzard · 2020-03-09T09:46:46Z

@bashbaug @jbrodman

About the behaviour of piDeviceGetInfo() for PI_USM_HOST_SUPPORT, PI_USM_DEVICE_SUPPORT, PI_USM_SINGLE_SHARED_SUPPORT, PI_USM_CROSS_SHARED_SUPPORT, and PI_USM_SYSTEM_SHARED_SUPPORT: re-reading the OpenCL cl_intel_unified_shared_memory extension, the SYCL USM proposal, the Level Zero Memory documentation and the CUDA driver API for Unified Addressing, it's starting to make sense...

However, it looks like there are still some corner case that are not very clear.
Is the API finalised ?

According to Table 5 in cl_intel_unified_shared_memory:

The host memory access capabilities apply to any host allocation.

so, PI_USM_HOST_SUPPORT queries whether a CUDA device can access memory allocated on the host by cuMemAllocHost (or cuMemHostAlloc), i.e. zero-copy memory with Unified Addressing.

The device memory access capabilities apply to any device allocation associated with this device.

so PI_USM_DEVICE_SUPPORT queries whether a CUDA device can access memory allocated on the device itself with cuMemAlloc, which I would take for granted ?

The table under USM Allocations in the USM proposal mentions for the device allocations also the possibility of peer-to-peer access from other devices. How is that queried ?
piDeviceGetInfo() has a single device parameter, so I wouldn't know how to check that.

The single device shared memory access capabilities apply to any shared allocation associated with this device.

so PI_USM_SINGLE_SHARED_SUPPORT query whether the device supports managed memory, allocated with cuMemAllocManaged.

The cross-device shared memory access capabilities apply to any shared allocation associated with this device, or to any shared memory allocation on another device that also supports the same cross-device shared memory access capability.

so PI_USM_CROSS_SHARED_SUPPORT queries whther the device supports peer-to-peer access to managed memory.

The shared system memory access capabilities apply to any allocations made by a system allocator, such as malloc or new.

so PI_USM_SYSTEM_SHARED_SUPPORT queries whether the device can access host memory allocated by the system allocator (e.g. malloc() and new); for CUDA devices this is only available on Power9 machines as far as I know, so I have no way to test it.

jbrodman · 2020-03-10T18:10:16Z

@bashbaug @jbrodman

About the behaviour of piDeviceGetInfo() for PI_USM_HOST_SUPPORT, PI_USM_DEVICE_SUPPORT, PI_USM_SINGLE_SHARED_SUPPORT, PI_USM_CROSS_SHARED_SUPPORT, and PI_USM_SYSTEM_SHARED_SUPPORT: re-reading the OpenCL cl_intel_unified_shared_memory extension, the SYCL USM proposal, the Level Zero Memory documentation and the CUDA driver API for Unified Addressing, it's starting to make sense...

However, it looks like there are still some corner case that are not very clear.
Is the API finalised ?

No. It's mostly complete, but there are still things that might change as we get more experience and usage.

According to Table 5 in cl_intel_unified_shared_memory:

The host memory access capabilities apply to any host allocation.

so, PI_USM_HOST_SUPPORT queries whether a CUDA device can access memory allocated on the host by cuMemAllocHost (or cuMemHostAlloc), i.e. zero-copy memory with Unified Addressing.

Yes.

The device memory access capabilities apply to any device allocation associated with this device.

so PI_USM_DEVICE_SUPPORT queries whether a CUDA device can access memory allocated on the device itself with cuMemAlloc, which I would take for granted ?

Yes.

The table under USM Allocations in the USM proposal mentions for the device allocations also the possibility of peer-to-peer access from other devices. How is that queried ?

TBD.

The single device shared memory access capabilities apply to any shared allocation associated with this device.

so PI_USM_SINGLE_SHARED_SUPPORT query whether the device supports managed memory, allocated with cuMemAllocManaged.

Correct.

The cross-device shared memory access capabilities apply to any shared allocation associated with this device, or to any shared memory allocation on another device that also supports the same cross-device shared memory access capability.

so PI_USM_CROSS_SHARED_SUPPORT queries whther the device supports peer-to-peer access to managed memory.

Not entirely. P2P is kind of a special case. This is more is one shared allocation able to migrate between different devices in a platform. P2P is really more about if it can migrate directly without needing to go through the host.

The shared system memory access capabilities apply to any allocations made by a system allocator, such as malloc or new.

so PI_USM_SYSTEM_SHARED_SUPPORT queries whether the device can access host memory allocated by the system allocator (e.g. malloc() and new); for CUDA devices this is only available on Power9 machines as far as I know, so I have no way to test it.

Correct. I don't expect to really be done until multiple vendors implement the same protocol for coherent device attach.

jbrodman

LGTM. Thanks!

fwyzard · 2020-03-10T21:37:13Z

@jbrodman thank you very much for the clarifications.

PI_USM_SINGLE_SHARED_SUPPORT query whether the device supports managed memory, allocated with cuMemAllocManaged.

Correct.

PI_USM_CROSS_SHARED_SUPPORT queries whether the device supports peer-to-peer access to managed memory.

Not entirely. P2P is kind of a special case. This is more one shared allocation able to migrate between different devices in a platform. P2P is really more about if it can migrate directly without needing to go through the host.

I see.
Then I think I should slightly change the answer for the queries to PI_USM_SINGLE_SHARED_SUPPORT and PI_USM_CROSS_SHARED_SUPPORT to reflect this.

bader · 2020-03-14T06:23:27Z

@fwyzard the latest commit is not signed and it looks like it have been squashed.
Could you also update the branch, please? I merged conflicting PR: #1299.

Other than that I don't have any other comments to this patch. Thanks!

Signed-off-by: Andrea Bocci <andrea.bocci@cern.ch>

Device - USM-related calls to piDeviceGetInfo Kernel - piextKernelSetArgPointer USM - piextUSMHostAlloc - piextUSMDeviceAlloc - piextUSMSharedAlloc - piextUSMFree - piextUSMEnqueueMemset - piextUSMEnqueueMemcpy - piextUSMEnqueuePrefetch (*) - piextUSMEnqueueMemAdvise (*) - piextUSMGetMemAllocInfo (*) due to the incomplete documentation of the USM extension: - piextUSMEnqueuePrefetch ignores the "flags" argument; - piextUSMEnqueueMemAdvise does nothing. Signed-off-by: Andrea Bocci <andrea.bocci@cern.ch>

Signed-off-by: Andrea Bocci <andrea.bocci@cern.ch>

fwyzard · 2020-03-14T06:48:01Z

Fixed commit history, rebased on top of #1310 .

bjoernknafla · 2020-03-14T11:50:54Z

#1310 has been merged.

@fwyzard did you manage to get the LIT tests to work for CUDA? I am not sure (haven’t tested) if your SYCL_BE env var fix would be enough or if you need the more extensive fix of the get-device-count-by-type tool for it?

fwyzard · 2020-03-14T16:12:18Z

@fwyzard did you manage to get the LIT tests to work for CUDA? I am not sure (haven’t tested) if your SYCL_BE env var fix would be enough or if you need the more extensive fix of the get-device-count-by-type tool for it?

I'm actually testing with all these PRs on top o fthe sycl branch:

[SYCL][CUDA] Fixes for multiple backends in the same program #1252 [SYCL] Fixes for multiple backends in the same program
[SYCL] Run the LIT tests using the selected backend #1288 [SYCL] Run the LIT tests using the selected backend
[SYCL][CUDA] Improve CUDA backend documentation #1293 [SYCL][CUDA] Improve CUDA backend documentation
[SYCL][CUDA] Fix LIT testing with CUDA devices #1300 [SYCL][CUDA] Fix LIT testing with CUDA devices
[SYCL][CUDA] Replace assert with CHECK #1302 [SYCL][CUDA] Replace assert with CHECK
[SYCL][CUDA] Fix and cleanup more CUDA LIT fails #1303 [SYCL][CUDA] LIT XFAIL/UNSUPPORTED
[WIP][SYCL][CUDA] Lit exceptions #1304 [SYCL][CUDA] Lit exceptions

but I haven't gotten around to trying the LIT tests, yet.

This comment has been minimized.

Sign in to view

fwyzard changed the title ~~[SYCL][CUDA] Implement part of USM for cuda~~ [SYCL][CUDA] Implement part of USM Mar 4, 2020

fwyzard force-pushed the Implement_part_of_USM_for_CUDA branch from 99c9d32 to 1af2918 Compare March 4, 2020 00:47

fwyzard mentioned this pull request Mar 4, 2020

Rework the implementation for the SYCL backend makortel/pixel-standalone#37

Merged

bader added the cuda CUDA back-end label Mar 4, 2020

bader requested a review from jbrodman March 4, 2020 09:11

bader assigned romanovvlad Mar 4, 2020

fwyzard force-pushed the Implement_part_of_USM_for_CUDA branch from 1af2918 to d0c1f96 Compare March 4, 2020 13:50

Ruyk previously approved these changes Mar 4, 2020

View reviewed changes

jbrodman reviewed Mar 4, 2020

View reviewed changes

sycl/plugins/cuda/pi_cuda.cpp Outdated Show resolved Hide resolved

bjoernknafla reviewed Mar 5, 2020

View reviewed changes

sycl/plugins/cuda/pi_cuda.cpp Show resolved Hide resolved

bjoernknafla reviewed Mar 5, 2020

View reviewed changes

sycl/plugins/cuda/pi_cuda.cpp Outdated Show resolved Hide resolved

bjoernknafla previously approved these changes Mar 5, 2020

View reviewed changes

fwyzard dismissed stale reviews from bjoernknafla and Ruyk via 14ce4c4 March 5, 2020 17:57

fwyzard force-pushed the Implement_part_of_USM_for_CUDA branch 4 times, most recently from ae88ebb to 7f60dbf Compare March 5, 2020 18:44

fwyzard force-pushed the Implement_part_of_USM_for_CUDA branch from 7f60dbf to a1e51fc Compare March 5, 2020 18:51

fwyzard force-pushed the Implement_part_of_USM_for_CUDA branch 2 times, most recently from 89fd6ab to edf470d Compare March 9, 2020 19:36

romanovvlad assigned jbrodman and unassigned romanovvlad Mar 10, 2020

jbrodman previously approved these changes Mar 10, 2020

View reviewed changes

fwyzard dismissed jbrodman’s stale review via 59316da March 10, 2020 21:46

fwyzard requested review from bader and jbrodman March 13, 2020 21:48

fwyzard force-pushed the Implement_part_of_USM_for_CUDA branch from 59316da to da4eebb Compare March 13, 2020 21:58

bader mentioned this pull request Mar 14, 2020

[SYCL][CUDA] Multiple fixes #1299

Merged

fwyzard added 3 commits March 14, 2020 07:33

[SYCL][CUDA] Fix CUDA version conversion

55cf04c

Signed-off-by: Andrea Bocci <andrea.bocci@cern.ch>

[SYCL][CUDA] Mark LIT tests for supported USM features

8f3d479

Signed-off-by: Andrea Bocci <andrea.bocci@cern.ch>

fwyzard force-pushed the Implement_part_of_USM_for_CUDA branch from 7039ac0 to 8f3d479 Compare March 14, 2020 06:47

bader approved these changes Mar 14, 2020

View reviewed changes

bader merged commit 498d56c into intel:sycl Mar 14, 2020

fwyzard deleted the Implement_part_of_USM_for_CUDA branch March 14, 2020 16:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYCL][CUDA] Implement Intel USM extension #1241

[SYCL][CUDA] Implement Intel USM extension #1241

fwyzard commented Mar 4, 2020 •

edited

Loading

This comment has been minimized.

bader commented Mar 4, 2020

fwyzard commented Mar 4, 2020

Ruyk left a comment

Ruyk Mar 4, 2020

fwyzard commented Mar 5, 2020

Ruyk commented Mar 6, 2020

fwyzard commented Mar 6, 2020

fwyzard commented Mar 6, 2020

fwyzard commented Mar 9, 2020

jbrodman commented Mar 10, 2020

jbrodman left a comment

fwyzard commented Mar 10, 2020

bader commented Mar 14, 2020

fwyzard commented Mar 14, 2020

bjoernknafla commented Mar 14, 2020

fwyzard commented Mar 14, 2020

[SYCL][CUDA] Implement Intel USM extension #1241

[SYCL][CUDA] Implement Intel USM extension #1241

Conversation

fwyzard commented Mar 4, 2020 • edited Loading

Device

Kernel

USM

Limitations

This comment has been minimized.

bader commented Mar 4, 2020

fwyzard commented Mar 4, 2020

Ruyk left a comment

Choose a reason for hiding this comment

Ruyk Mar 4, 2020

Choose a reason for hiding this comment

fwyzard commented Mar 5, 2020

Ruyk commented Mar 6, 2020

fwyzard commented Mar 6, 2020

fwyzard commented Mar 6, 2020

fwyzard commented Mar 9, 2020

jbrodman commented Mar 10, 2020

jbrodman left a comment

Choose a reason for hiding this comment

fwyzard commented Mar 10, 2020

bader commented Mar 14, 2020

fwyzard commented Mar 14, 2020

bjoernknafla commented Mar 14, 2020

fwyzard commented Mar 14, 2020

fwyzard commented Mar 4, 2020 •

edited

Loading